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Abstract 

Discovering the latent structure from many observed variables is an important yet challenging 
learning task. Existing approaches for discovering latent structures often require the unknown 
number of hidden states as an input. In this paper, we propose a quartet based approach which 
is agnostic to this number. The key contribution is a novel rank characterization of the tensor 
associated with the marginal distribution of a quartet. This characterization allows us to design 
a nuclear norm based test for resolving quartet relations. We then use the quartet test as a 
subroutine in a divide-and-conquer algorithm for recovering the latent tree structure. Under 
mild conditions, the algorithm is consistent and its error probability decays exponentially with 
increasing sample size. We demonstrate that the proposed approach compares favorably to 
alternatives. In a real world stock dataset, it also discovers meaningful groupings of variables, 
and produces a model that fits the data better. 

1 Introduction 

Discovering the latent structure from many observed variables is an important yet challenging 
learning task. The discovered structures can help better understand the domain and lead to po- 
tentially better predictive models. Many local search heuristics based on ma ximum parsimon y and 
maxirnum likelihood methods have been proposed to addre s s this problem (ISemple &: Steel [200a; 



Zhand . 120041 : iHeller fc Ghahramani l2005|; iTeh et al.1 . l2008l : IHarmeling fc Wilhamal . |2010|). Their 



common drawback is that it is difficult to provide consistency guarantees. Furthermore, the num- 
ber of hidden states often needs to be determined before the structure learning. Or cross-validations 
are needed to determine the hidden states, which can be very time consuming to run. 

Efficient algorithms with provable performance guarantees have been explored in the phyloge- 
netic t ree reconstruction c ommunity. One popular algorithm is the neighbor- joining (NJ) algo- 
rithm (jSaitou &: Neil . Il987l ). where pairs of variables are joined recursively according to a certain 
distance measure. The NJ algorithm is consistent when the distance measure satisfies the path 
additive property ( Mihaescu et al. . 20091 ). For discrete random variables, the additi ye distance is 
defined using the determinant of the joint probability table of a pair of variables (Lake, 19941 ). 
However, this definition only applies to the cases where the observed variables and latent variables 
have the same number of states. When the latent variables represent simpler factors with smaller 
number of states, the NJ algorithm can perform poorly. 

Another fani i ly of provably con s istent reconstruction methods is the quartet-based methods 



(|Semple Steei l2003l : lErdos et al.l . Il999l ). These methods first resolve a set of latent relations 
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for quadruples of observed variables (quartets), and subsequently, stitch them together to form a 
latent tree. A good quartet test plays ai i essential role in th e se me thods, as it is called repeatedly 
by the stitching algorithms. Recently, (jAnandkumar et al.l . l201ll ) proposed a quartet test using 
the leading k singular values of the joint probability table, where k is the number of hidden states. 
This new approach allows k to be different from the number of the observed states. However, it 
still requires k to be given in advance. 

Our goal is to design a latent structure discovery algorithm which is agnostic to the number 
of hidden states, since in practice we rarely know this number. The proposed approach is quartet 
based, where the quartet relations are resolved based on rank properties of 4th order tensors 
associated with the joint probability tables of quartets. The key insight is that rank properties of 
the tensor reveal the latent str ucture behind a quartet. Similar observa tions have been reported 
in the phylogenetic community (jErikssonl . l2005l : lAllman &: Rhoded . liooi l. but they are concerned 
about the cases where the number of hidden states is larger or equal to the number of observed 
states. We focus instead on the cases where the number of hidden states is smaller, representing 
simpler factors. Furthermore, if the joint probability tensor is only approximately given (due 
to sampling noise) the m ain rank conditi on has to be modified. In lAllman &: Rhodes! (|2006l l such 
condition is missing and in lErikssonI ^200^ ) the condition is heuristically translated to the distance of 
a matrix to its best rank-/c approximation. In contrast, we propose a novel nuclear norm relaxation 
of the rank condition, discuss its advantages, and provide recovery conditions and finite sample 
guarantees. Our quartet test is easy to compute since it only involves singular value decomposition 
of unfolded 4th order tensors. 

Using the proposed quarte t test as a subroutin e, the latent tree structure can be recovered in 
a divide-and-conquer fashion dPearl Tarsi Il986l ^. For d observed variables, the computational 
complexity of the algorithm is 0{dlogd), making it scalable to large problems. Under mild con- 
ditions, the tree construction algorithm using our quartet test is consistent and stable to estimate 
given a finite number of samples. In simulations, we compared to alternatives in terms of resolving 
quartet relations and building the entire latent trees. The proposed approach is among the best 
performing ones while being agnostic to the number of hidden states k. The latter is an important 
improvement, since cross validation for finding k is expensive while leading to similar final results. 
We also applied the new approach to a stock dataset, where it discovered meaningful grouping of 
stocks according to industrial sectors, and led a latent variable model that fits the data better than 
the competitors. 



2 Latent Tree Graphical Models 

In this paper, we focus on discrete latent variable models where the conditional independence 
structures are specified by trees. We assume that the d observed variables, 0' = {Xi, . . . , X^}, 
are leaves of the tree and that they all have the same number of states, n. We also assume the 
dh hidden variables, = {Xd+i, ■ ■ ■ ,Xd+dh}: have the sam^, but unknown, number of states, k, 
[k < n). Furthermore, we use uppercase letters to denote random variables {e.g., Xi) and lowercase 
letters their instantiations {e.g., Xi). 

Factorization of distribution. The joint distribution of all variables, ^ = 0" L) J^, in a 
latent tree model is a multi-way table (tensor), V, with d + df^ dimensions. Although the tensor 

^Our results are easily generalizable to the case where all hidden variables have different number of states. 
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has 0{n'^k'^*^) number of entries, they can be computed from just a polynomial number of param- 
eters due to the latent tree structure. That is V{xi, . . . ^Xd+d,^) = Wi=i'^ P{xi\x-K^) where each 
is a conditional probability table (CPT) of a variable Xi and its parent X^. in the treeJl 
This factorization leads to a significant saving in terms of tensor representation: we can represent 
exponential number of entries using just 0{dhk'^ + dnk) parameters from the CPTs. Throughout 
the paper, we assume that (Al) all CPTs have full column rank, k. 

Structure learning. Determining the tree topology T is an important and challenging learning 
problem. The goal is to discover the latent structur e based just on samples from observed variables. 
For simplicity and uniqueness of the tree topology (Pearl, 19881 ). we assume that (A2) every latent 
variable has exactly 3 neighbors. 

Quartet. A quadruple of observed variables from a latent tree T is called a quartet (Figure [I]). 
Under assumption (A2) , there are 3 ways to connect a quartet, Xi, X2, X3, X4, using 2 latent vari- 



:@>o--(g): 



Figure 1: Quartet {Xi, X2, X3, X4) from a tree, 
ables H and G (Figure [2|). However, only one of the 3 quartet relations is consistent with T. The 




{{1,2},{3,4}} {{1,3},{2,4}} {{1,4},{2,3}} 
Figure 2: Three fixed ways to connect Xi, X2, X^, X4, with two latent variables H and G. 



mapp ing between quartets and the tree topology T is captured in the following theorem (jBuneman 
I971I ): 



Theorem 1. The set of all quartet relations Qj- is unique to a latent tree T, and furthermore, T 
can he recovered from Qj- in polynomial time. 

Quartet-based tree reconstruction. Motivated by Theorem [H a family of latent tree re- 
covery algorithms has been designed based on resolving quartet relations. These algorithms first 
determine one of the 3 ways how 4 variables are connected, and then join together all quartet 
relations to form a consistent latent tree. For a model with d observed variables, there are O(d^) 
quartet relations in total (taking all possible combinations of 4 variables). However, we do not 
necessarily need to resolve all these quartet relations in order to reconstruct the latent tree. A 
small set of size 0{d\ogd) will suffice for the tree recovery , which mak es quartet based methods 
efficient even for problems with large d ( Pearl fc: Tarsi . 19861 : Pearl . 1988 ). In this paper, we design 
a new quartet based method. Our main contribution compared to previous approaches is that our 
method is agnostic to the number of hidden states, k, which is usually unknown in practice. 



^For a latent tree, we can select a latent node as the root, and re-orient all edges away from it to induce consistent 
parent-child relations. For the root node Xr, P(Xr|X^^) = P(Xr). 
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3 Resolving Quartet Relations without Knowing the Number of 
Hidden States 



In this section, we develop a test for resolving the latent relation of a quartet when the number of 
hidden states is unknown. Our approach makes use of information from the joint probability table 
of a quartet, which is a 4-way table or 4th order tensor. Suppose that the quartet relation of 4 
variables, Xi, X2, and X4, is {{1, 2}, {3, 4}}, then the entries in this tensor are specified by 

P(xi,X2,X3,X4) = V, P{xi\h)P{x2\h)P{h,g)P{x3\g)P{xi\g). (1) 

This factorization suggests that there exist some low rank structures in the 4th order tensor. To 
study the rank properties of X2, X3, X4), we first relate it to the conditional probability 

tables, P{Xi\H), P{X2\H), P{X3\G), P{X4\G), and the joint probability table, P{H,G) (we ab- 
breviate them as Pi\h^ P2\h-, -fslGi Pi\G ^"^^ Phg-, respectively). Using tensor algebra, we have 

p(Xi,x2,x3,x4) = (ri,r2)3, 

with Tl = X// Xi Pi\H X2 P2\H, 

7i = Xg xi P31G X2 P4IG X3 Phg, 

where Th and Tq are 3rd order diagonal tensors of size k x k x k with diagonal elements equal to 
1. The multiplication Xj denotes a tensor-matrix multiplication with respect to the i-th dimension 
of the tensor and the rows of the matrix, and (•, •)3 denotes tensor-tensor multiplication along the 
third dimension of both tensor^. This formula can be schematically understood as Figure [3l We 




Sine 




Figure 3: Schematic diagram of the tensor V{Xi, X2, X3, X4). 

will start by characterizing the rank properties of V and then exploit them to design a quartet test. 
Although the proposed approach involves unfolding the tensor and subsequent computation at the 
matrix level, modeling the problem using tensors provides higher level conceptual understanding 
of the structure of V. The novelty of our use of low rank tensors is for latent structure discovery. 

3.1 Unfolding the 4th Order Tensor 

Now we consider 3 different reshapings A, B and C of the tensor into matrices ( "unfoldings" ) . 
These unfoldings contain exactly the same entires as V but in different order. A corresponds to the 
grouping {{1, 2}, {3, 4}} of the variables, i.e., the rows of A correspond to dimensions 1 and 2 of P, 
and its columns to dimensions 3 and 4. B corresponds to the grouping {{1, 3}, {2, 4}} and C - to 
the grouping {{1, 4}, {2, 3}}. Using Matlab's notation (see appendix, ^for further explanation), 



^For formal definitions of tensor notations see appendix, 
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A = reshape('P, n^, n^); (2) 

B = reshape(permute(P, [1,3, 2, 4]), n^, n^); (3) 

C = reshape(permute(P, [1, 4, 2, 3]), n^). (4) 

Next we present useful characterizations of A, B and C, which wiU be essential for understanding 
their connection with the latent structure of a quartet. The Kronecker product of two matrices M 
and M' is denoted as M (8) M' , and if they have the same number of columns, their Khatri-Rao 
product (column-wise Kronecker product), is denoted as MQM' . Then (see appendix f|9]for proof). 

Lemma 2. Assume that {{1, 2}, {3, 4}} is the correct latent structure. The matrices A, B and C 
can be factorized respectively as (see Figure^a) and Figure^h) for schematic diagrams) 

A = {P^\H Pi\h) Phg {Pa\g Pz\Gy, (5) 



B = {P3\G Pi\h) diag{PHG{-)) {P4\G Pmf, 
C = {P^\a Pi\h) diag(PjfG(:)) {P^g ^2|i?)^- 

Pz\G Pl\H diag(P/i-G(0) Pi\G P2\H 



(6) 
(7) 



P2\H Pi\H Phg Pi\G PaiG 







□ 












(a) A (b) B 

Figure 4: Schematic diagrams of the two unfoldings A and B. 

The factorization of A is very different from those of B and C. First, in A, P2\h Pi\H is a 
matrix of size n'^ x k, and the columns of P2\h interact only with their corresponding columns in 
Pi\H- However, in B, P^^g Pi\H is a matrix of size n? x fe^, and every column of Pi\h interacts 
with every column of Ps\g respectively (similarly for C). Second, in A, the middle factor Phg has 
size k X k, whereas in B, the entires of Phg appear as the diagonal of a matrix of size k'^ x k'^ 
(similarly for C). These differences result in different rank properties of A, B and C which we will 
exploit to discover the latent structure of a quartet. 



3.2 Rank Properties of the Unfoldings 

Under assumption (Al) that all CPTs have full column rank, the factorization of A, B and C 
in ([5]), ([6]) and ([7]) respectively suggest that (see appendix ^for more details) 

rank(A) = vank{PHG) = k < rank(i?) = rank(C) = nnz(PH'G')! (8) 

where nnz(-) denotes the number of nonzero elements. We note that the equality is attained if and 
only if the relationship between the hidden variables G and H is deterministic, i.e., there is a single 
nonzero element in each row and in each column of Phg- In this case, the grouping of variables in 
a quartet can be arbitrary, and we will not consider this case in the paper. More specifically, we 
have 

Theorem 3. Assume Phg has a few zero entries, then k <^ k'^ ^ nnz{PHG) o,nd thus 

(9) 



rank(^) <C rank(S) = rank(C). 
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The above theorem reveals a useful difference between the correct grouping of variables and the 
two incorrect ones. Furthermore, this condition can be easily verified: Given V we can check the 
rank of its matrix representations A, B and C and thus discover the latent structure of the quartet. 

3.3 Nuclear Norm Relaxation for the Rank Condition 

In practice, due to sampling noise all unfolding matrices A, B and C would be nearly full rank, 
so the rank condition cannot be applied directly. To deal with this, we design a test based on 
relaxation of the rank condition using nuclear norm 

En 
. ^^^{M), (10) 
1=1 

which is the sum of all singular values of an (n x n) matrix M. Instead of comparing the ranks of 
A, B and C, we look for the one with the smallest nuclear norm and declare the latent structure 
corresponding to it. This simple quartet algorithm is summarized in Algorithm [TJ Note that 

Algorithm 1 i* = Quartet(Xi, X2, X3, X^) 

1: Estimate ^(Xi, X2, X3, X4) from a set of m i.i.d. samples {{x[, X2, xi^, x'-^)}^^. 

2: Unfold V in three different ways into matrices A, B and C, and compute their nuclear norms 

ai = ll^ll*, a2 = and 03 = \\C\\^:. 

3: Return i* = argmin^gj]^ 2,3} (^i- 



Algorithm [T] works even if the number of hidden states, k, is a priori unknown. This is an irn portant 
advantage over the idea of learning the structure based on additive distance ( Lake . 19941 ). where 
k is assumed to be the same as the number of states, n, of t he observed variables, or over a 
recent approach based on quartet test ( Anandkumar et al. . 201 ll ). where k needs to be specified in 
advance. 

In our current context, nuclear norm h as a few useful pr operties. First, it is the tightest 
convex lower bound of the rank of a matrix ( Fazel et al. . 2001 ). This is whjQ it is meaningful to 
compare nuclear norms instead of ranks. Second, it is easy to compute: a standard singular value 
decomposition will do the job. Third, it is robust to estimate. The nuclear norm of a probability 



matr ix A based on samples is nicely concentrated around its population quantity (jRosasco et al 
20inl ^. Given a confidence level 1 — 2e ^, an estimate based on m samples satisfies 



1^1 



(11) 



Fourth, the nuclear norm can be viewed as a measure of dependence between two pairs of variables. 
For instance, if A corresponds to grouping {{1, 2}, {3, 4}}, \\A\\^ measures the dependence between 
the compound variables {Xi, X2} and {X^, X4}. In the community of kernel methods, A is treated 
as a cross-covariance operator between {Xi,X2} and {X^jX^}, and its spectrum has been used 
to design various dependence measures, such as Hilbert-Schmi dt Independence Criterion, which is 
the sum of squares of all singular values ( Gretton et al. I.l2005al'l. and kernel constrained 



covariance, 



which only takes the largest singular value ( Gretton et al. . 2005bl ). Intuitively, our quartet test 



^Note that A, B and C consist of the same elements so their Frobenius norms are the same, i.e., the 3 matrices 
are readily equally "normalized". 
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says that: if we group the variables correctly, then cross group dependence should be low, since the 
groups are separated by two latent variables; however if we group the variables incorrectly, then 
cross group dependence should be high, since similar variables exist in the two groups. 



4 Recovery Conditions and Finite Sample Guarantee for Quartets 

Since nuclear norm is just a convex lower bound of the rank, there might be situations where 
the nuclear norm does not satisfy the same relation as the rank. That is, it might happen that 
rank(^) < rank(S) but ||^||* > In this section, we present sufficient conditions under which 

nuclear norm returns successful quartet test. 

When latent variables H and G are independent, rank{PHG) = 1, since Phg = PrPg 
{P{h,g) = P{h)P{g)). Let {{1, 2}, {3, 4}} be the correct quartet relation. We can obtain simpler 
characterizations of the 3 unfoldings of 'P(Xi, X2, X3, X4), denoted as A_\_^ B± and C± respectively. 
Using Lemma [2] and the independence of H and G, we have (see appendix, (|26p ^ ([2l 



A± = {P2\H Pi\h) PhP^ {Pa\g O Pz\Gy 
= Pl2{:) P34i:V, 

B± = (P3IG ® Pl|H)(diag(PG) ® diag(PH))(^4|G ^ 
= P34^Pl2, 



(12) 



and rank(A_|_) = 1 ^ rank(i?_|_) which is consistent with Theorem [3l Furthermore, since A± has 
only one nonzero singular value, we have II 74_L II = ||A_l||_f = ||-B±||f < ll-S±ll* (using HMHj? < ||M||=k 
for any matrix M). Similarly, G± = P43 P12 and ||^_l||* < ||C_l||=k. Then we know for sure that 
the nuclear norm quartet test will return the correct topology. 

When latent variables H and G are not independent, we treat it as perturbation A 
away from the independent case, i.e., Phg = PrPg + A. The size of A quantifies the strength 
of dependence between H and G. Obviously, when A is small, e.g., A = 0, we are back to the 
independence case and it is easy to discover the correct quartet relation; when it is large, e.g., A = 
I — PhPq , H and G are deterministically related and the different groupings are indistinguishable. 
The question is how large can A be while still allowing the nuclear norm quartet test to find the 
correct latent relation. 

First, we require (A3) Al = 0, and A^l = 0, where 1 and are vectors of all ones and all 
zeros. Such perturbation A keeps the marginal distributions Ph and Pq as in the independent 
case, since Ph = Phg^ = PePq^ + Al = Ph- Assuming {{1, 2}, {3, 4}} is the correct quartet 
relation, A also keeps the pairwise marginal distribution P12 as in the independent case, since 
P12 = Pi\H 'i-^^siPH)Pj\H ^^'^ marginal Ph is the same before and after the perturbation. 
Similar reasoning also applies to P34 = P^ig 'iiag{PG)Pj^Q. 

We define excessive dependence of the correct and incorrect groupings as 

9 := m.m{\\B±\U - P±||*, ||C±||, - \\A^\U}. 

It quantifies the changes in dependence when we switch from incorrect groupings to the correct 
one (in the case when H and G are independent). Note that 9 is measured only from pairwise 
marginals (I12p . P12 and P34. Using matrix perturbation analysis we can show that (see appendix 
^111 for proof) 
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Lemma 4. //||A||p < ^r^:^; then Alg or ithm\^ returns the correct quartet relation. 

Thus, if the excessive dependence 9 is large compared to the number of hidden states, the size 
of the ahowable perturbation can be correspondingly larger. In other words, if the dependence 
between variables within the same group is strong enough compared to the dependence across 
groups, we allow for larger A and stronger dependence between hidden variables H and G (which 
is closer to the indistinguishable case). Then under the recovery condition in Lemma HI and given 
m i.i.d. observations, we can obtain the following guarantee for the quartet test (see appendix, ^T3l 
for proof). Let a = min{||i?||* — ||C||=|, — 

Lemma 5. With probability 1 — Se^M*"" ^ Algorithmic returns the correct quartet relation. 



5 Building Latent Tree from Quartets 



Algorithm. We can use the resolved quartet relations (Algorit hm [H) to discover the structure o f 
the entire tree via an incremental divide-and-conquer algorithm ( Pearl Sz Tarsi . 19861 : Pearj 19881 ). 
summarized in Algorithm [2] (further details in appendix §10p . Joining variable Aj+i to the current 
tree of i leaves can be done with O(logz) tests. This amounts to performing 0{dlogd) quartet 
tes ts for building an ent ire tree of d leaves, which is efficient even if d is large. Moreover, as shown 
(jPearl Tarsi Il986l ^. this algorithm is consistent. 



m 



Algorithm 2T = BuildTree(Ai, . . . , A^) 
1: Connect any 4 variables Ai, A2, A3, A4 with 2 latent variables in a tree T using Algorithm [TJ 



2: for z = 4, 5, . . . , d — 1 do {insert {j+i)-th leaf Aj+i} 

3: Choose root R that splits T into sub-trees 71,72,73 of roughly equal size. 
4: Choose any triplet ( Aj^ , Ajg , Ajg ) of leaves from different sub-trees. 
5: Test which sub-tree should Aj+i be joined to: 

i* ^ Quartet (Ai+i,Ai^,Ai2,Ai3). 
6; Repeat recursively from step 3 with T :=%*■ 

This will eventually reduce to a tree with a single leaf. Join Aj+i to it via hidden variable. 
7: end for 



Tree recovery conditions and guarantees. How will the quartet recovery conditions trans- 
late to recovery conditions for the entire tree, where each "edge" of a quartet is a path in the tree? 
What are the finite sample guarantees for the divide-and-conquer algorithm? 

When a quartet is taken from a latent tree, each edge of the quartet corresponds to a path 
in the tree involving a chain of variables (Figure [2]). We need to bound the perturbation to each 
single edge of the tree such that joint path perturbations satisfy edge perturbation conditions from 
Lemma m For a quartet q = {{ii, ^2}) {^3, u}} corresponding to a single edge between H and 
G, denote the excessive dependence by 9q. By adding perturbation of size smaller than ^2^^ 
to PhP^ we can still correctly recover q. Let ^min := miuquartet qdq. If we require ||Ag||i? < 
all such quartet relations will be recovered successfully. If we further restrict the size of 
the perturbation by the smallest value in a marginal probability distribution of a hidden variable, 
7min := niiuhidden node H ^^^i=i...k -Pff(i), we Can guarantee that all quartet relations corresponding 
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to a path between H and G can also be successfully recovered by the nuclear norm test (see appendix 
fT2]l . Therefore, we assume that (A4) ||Ag||^ < min{ , 7min} for all quartets q in a tree. 

Theorem 6. Algorithmic returns the correct tree topology under assumptions (Al)— (A4). 

The recovery conditions guarantee that all quartet relations can be resolved correctly and si- 
multaneously. Then a consistent algorithm using a subset of the quartet relations should return 
the correct tree structure. Given m i.i.d. samples, we have the following statistical guarantee for 
the tree building algorithm (see appendix, ^Hlfor proof). Let amin := miriquartet gCtg. 

1 2 

Theorem 7. With probability 1 — 8 • c • dlogd ■ e~32™"min^ Algorithmic recovers the correct tree 
topology for a constant c under assumptions (Al)— (A4) . 

We note that there are better quartet bas ed algorithms for building latent trees with stronger 
statistical guarantees, e.g. ( Erdos et al. . 19991 ). We can adapt our nuclear norm based quartet test 



to those algorithm as well. However, this is not the main focus of the paper. We choose the divide- 
and-conquer algorithm due to its simplicity, ease of analysis and it illustrates well how our quartet 
recovery guarantee can be translated into a tree building guarantee. 



6 Experiments 



We compared our al gorithm with representative alg orithms: the neighb o r-join ing algorithm (NJ) 
( Saitou fc Nei . 1983), a quartet based algorit hm of Ana.ridkumar et al. ( 2011 ) (Sp ectral® fc), the 



Chow -Liu neighbor Joining algorithm (CLNJ) (|Choi et al 
(|201fll ) (HW). 



201 ll ) , and an algorithm of lHarmeling fc Williams 



NJ proceeds by recursively joining two variables that are closest according to an additive dis- 
tance defined as dij = | log det diag Pi — log | det Pij \ + \ log det diag Pj , where "det" denotes de- 
terminant, "diag" is a diagonalization operator, Pij denotes the joint probability table P{Xi,Xj), 
and Pi and Pj the probability vector P{Xi) and P{Xj) respectively (Lake, 19941 ). When Pij has 
rank k < n, log|detPjj| is no t defined, NJ can perform poorly. Spectral@A: uses singular values 
of Pij to design a quartet test ( Anandkumar et all . 201 ll ). For instance, if the true quartet config- 
uration is {{1, 2}, {3, 4}} as in Figure El then the quartet needs to satisfy Y\s=i'^s{Pi2)o's{P34:) > 
max{J3j^^ (Ts(Pi3)(Ts(P24)5 Y\^=i'^s{Pi4:)'^s{P23)}- Based on this relation, a confidence interval 
based quartet test is designed and used as a subroutine for a tree reconstruction algorithm. 
Spectral® A: can handle cases with k < n, but still require k as an input. We will show in later 
experime nts that its perforin ance is sensitive to the choice of k. CLNJ first applies Chow-Liu al- 
gorithm ( Chow &: Liu . 19681 ) to obtain a fully observed tree and then proceeds by adding latent 
variables using neighbor joining algorithm. The HW algorithm is a greedy algorithm to learn bi- 
nary trees by iteratively joining two nodes with a high mutual information. The number of hidden 
states is automatically determined in the HW algorithm and can be different for difi^erent latent 
variables. 



6.1 Resolving Quartet Relations 

We compared our method to NJ and Spectral@/c in terms of their ability to recover the quartet 
relation among four variables. We used quartet with three different configurations for the hidden 
states: (1) kn = 2 and /cc = 4 (small difference); (2) kn = 2, kc = 8 (large difference); and (3) 
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Figure 5: (a)-(f) Quartet recovery results, (g)-(l) Tree recovery results, "tensor" is our method. 

k}{ = 4, fcc = 4 (no difference). In all cases, the states of the observed variables were fixed to n = 10. 
In all cases we started from independent Phg but identity Pxi\H ^'^^ Pxi\Gi ^-^d perturbed them 
using the following formula P(a = i\h) = ^^p^![^|^"'^ , where all Uj are i.i.d. random variables 
drawn from Uniform[0, //]. We then drew random sample from the quartet according to these 
CPTs. We studied the percentage of correctly recovered quartet relations as we varied the sample 
size across S = {50, 100, 200, 300, 400, 500, 750, 1000, 1500, 2000} and under two different levels of 
perturbation (/i = {0.5,1}). We randomly initialized each experiment 1000 times and report the 
average quartet recovery performance and the standard error in Figure [5j 

The proposed method compares favorably to NJ and Spectral®/?. The performance of Spectral@A; 
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varies a lot depending on the chosen number of singular values k. Our method is free from tuning 
parameters and often stays among the top performing ones. Especially when the number of hidden 
states are very different from each other (kn = 2 and kc = 8), our method is leading the second 
best by a large gap (Figure [5(b) [ and 5(e) ). When both hidden states are the same [kn = kc = 4), 
the Spectral@A; achieves the best performance when the chosen number of singular values k is the 
same as kn- Note that allowing Spectral® A: to use different k resembles using cross validations 
for finding the best k. It is expensive while our approach performs almost indistinguishable from 
Spectral®^ even it choose the best k. 



6.2 Discovering Latent Tree Structure 

We used different tree topologies and sample sizes in this experiment. We generated tree topologies 
by randomly splitting 16 observed variables recursively into two groups. The recursive splitting 
stops when there are only two nodes left in a group. We introduced a hidden variable to join the 
two partitions in each recursion and this gives a latent tree structure. The topology of the tree is 
controlled by a single splitting parameter /3 which controls the relative size of the first partition 
versus the second. If /3 is close to or 1, we obtain trees of skewed shape, with long path of hidden 
variables. If f3 is close to 0.5, the resulting latent trees are more balanced. In our experiments, 
we experimented with skewed latent trees /3 = 0.2 and balanced trees /3 = 0.5. We first generate 
different random k between 2 and 8 for the hidden states, and then generate the probability models 
for each tree using the same scheme as in our previous experiment. Here we experimented with 
perturbation level /i = {0.2,0.5, 1}. 

We varied the sample size across S = {50, 100, 20 0,500, 1000, 2000), and n ieasured the error of 
the constructed tree using Robinson- Foulds metric ( Robinson &: Fouldsl . 1981 ). This measure is a 



metric over trees of the same number of leaves. It is defined as (a + b) where a is the number of 
partitions of variables implied by the learned tree but not by the true tree and b is the number of 
partitions of the variables implied by the true tree but not by the learned tree (in a sense similar 
to precision and recall score). 

The tree recovery results are shown in Figure 5(g)||5(l)" Again we can see that our proposed 



method compares favorably to existing algorithms. All through the 6 experimental conditions, 
the tensor approach and spectral@2 performed the best with sufficiently large sample sizes. Note 
that we tried out different k for Spectral®/? which resembles using cross validations for finding the 
best k. Even in this case, our approach works comparably without having to know k. Harmeling- 
William's algorithm performed well in small sample sizes, while CLNJ does not perform well in 
these experimental conditions. 



6.3 Understanding Latent Relations between Stocks 

We applied our algorithm to discover a latent tree structure from a stock dataset. Our goal is to 
understand how stock prices Xi are related to each other. We acquired closing prices of 59 stocks 
from 1984 to 2011 (from www.finance.yahoo.com), which provides us 6800 samples. The daily 
change of each stock price is discretized into 10 values, and we applied our algorithm to build a 
latent tree. A visualization of the learned tree topologies and discovered groupings are shown in 
Figure [6l 

We see nice groupings of stocks according to their industrial sectors. For instance, companies 
related to petroleum, such as CVX (Chevron), XOM (Exxon Mobil), APA (Apache), COP (Cono- 
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,. ... ~» » V » 

Tinance aerospace retail utility petroleum metal&transport technology pharmaceutical and healthcare 



Figure 6: Latent tree estimated from stock data. 



coPhillips), SLB (Schlumberger) and SUN (Sunoco), are grouped into a subtree. Pharmaceutical 
companies, such as MRK (Merck), PFE (Pfizer), BMY (Bristol Myers Squibb), LLY (Eh Lilly), 
ABT (Abbott Laboratories), JNJ (Johnson and Johnson) and BAX (Baxter International), are all 
grouped into a subtree. High-tech companies, such as AMD, MOT (Motorola), HPQ (Hewlett- 
Packard), IBM, are grouped into another subtree. There are also subtree for retailers, such as 
TGT (Target), WMT (Wal-Mart), RSH (RadioShack) , subtree for utility service companies, such 
as DUK (Duke Energy), ED (Consohdated Edison), EIX (Edison), ECX (Exelon), VZ (Verizon), 
and subtree related to financial companies, such as C (Citigroup), JPM (JPMorgan Chase), and 
AXP (American Express). We can also see subtree related to financial companies, such as C (Citi- 
group), JPM (JPMorgan Chase), and AXP (American Express). An interesting observation is that 
F (Ford Motor) which is well-known for its car manufacturing is also placed in the same branch 
as these financial companies. This seemingly abnormal structure can be explained by the fact that 
Ford Motor operates under two segments: Automotive and Financial Services. Its financial services 
include the operations of Ford Motor Credit Company and other financial services including holding 
companies, and real estate. In this respect, it is quite interesting that our algorithm discovered this 
hidden information. 

We also compared different algorithms in terms of held-out likelihood. We first randomized the 
data 10 times, and each time used half for training and half for computing the held-out likelihood. 
Then we estimated the latent binary tree structures using different algorithms. Finally, we fit latent 
variable models to the discovered structures. The number of the states for all hidden variables, k, 
were the same in each latent variable model. We experimented with k = 2,A, 6, 8, 10 to simulate 
the process of using cross validation to select the best k. The results are presented in Tabled) Note 



Table 1: Negative log-likelihood (xlO^) on test data. The small the number the better the method. 





Tensor 


Spectral® fc 


Choi (CLNJ) 


Neighbor-j oining 


Harmeling 


Chow-Liu 


k = 2 


4.41 


4.44 


4.43 


4.43 






fc = 4 


4.30 


4.35 


4.33 


4.33 






fc = 6 


4.28 


4.35 


4.32 


4.31 


4.31 


4.41 


fc = 8 


4.28 


4.35 


4.32 


4.31 






k = 10 


4.29 


4.37 


4.32 


4.31 







that Harmeling-William's algorithm automatically discovers k, so it does not use the experimental 
parameter k. Chow-Liu tree does not contain any hidden variables and hence just one number in 
the table. CLNJ and Neighbor-joining assume the states for the hidden and observed variables are 
the same during structure learning. However, in parameter fitting, we can still use different number 
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of hidden states k. In this experiment, the structure produced by our tensor approach produced 
the best held-out hkehhood. 



7 Conclusion 

In this paper, we propose a quartet-based method for discovering the tree structures of latent 
variable models. The practical advantage of the new method is that we do not need to pre-specify 
the number of the hidden states, a quantity usually unknown in practice. The key idea is to view 
the joint probability tables of quadruple of variables as 4th order tensors and then use the spectral 
properties of the unfolded tensors to design a quartet test. We provide conditions under which the 
algorithm is consistent and its error probability decays exponentially with increasing the sample 
size. In both simulated and a real dataset, we demonstrated the usefulness of our methods for 
discovering latent structures. While in this study we focus on the properties of the 4th order tensor 
and its various unfoldings, we believe that properties of tensors and methods and algorithms from 
multilinear algebra will allow to address many other problems arising from latent variable models. 
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Unfolding Latent Tree Structures using 4th Order Tensors 

Appendix 



8 Properties and Notations used 

Nuclear and Frobenius norms: 

• Let (7j be the singular values of A. Then 

\\A\U = J2ai, \\A\\l = J2af and < . (13) 

i i 

• (Nuclear and Frobenius norms are unitarily invariant) For any orthogonal Q we have 

Pll, = \\QA\U = \\AQ\U, 
\\Ay = \\QA\\f = \\AQ\\p. 

• Let (Ti be the singular values of X and ai be the singular values of X = X + E. Then 

||diag(a-i - <Ti)||* < ||X - . (15) 

Kronecker and Khatri-Rao products: 

{Ai^By = A'^i^B^ (16) 
{A + B)®C = A^C + B^C (17) 
AB^CD = {A^C){B^D) (18) 
ABQCD = {A^C){BQD) (19) 
\\A^B\\f = \\A\\f\\B\\f 
rank(^ (8)5) = rank(^) rank(S) 

Tensor operations: 

We use the following tensor-matrix products of a tensor A G M^i^^2x/3 -^^j^]^ matrices M^") G 
]^J„x/„^ n = 1,2,3: 

mode-1 product: {A»i M^^^)j^i^i^ = '*n»2i3"ijili > 



mode-2 product: {A •2 M('^'))i^j.^i._^ = y^.^_ Ojiiaia^al ' 

^3 (3) 

.7323 ' 



mode-3 product: {A»3 M('^))j^j2i3 = ^ ^112213™ 

•'——'13=1 



where 1 < in ^ ^ ^ jn ^ Jn- These products can be considered as a generalization of the left 
and right multiplication of a matrix A with a matrix M. The mode-1 product signifies multiplying 
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the columns (mode-1 vectors) of A with the rows of M^^^ and similarly for the other tensor-matrix 
products. 

The contracted product C of two tensors A G and B G ]^KxLxM a,long their third 

modes is a 4th order tensor denoted by C = {A,B)3. C G j^/xJxKxL -^g entries C{i,j,k,l), 
1 < i < I; I < j < J; I < k < K; 1 < I < L are defined as 

— ^m=l 

It can be interpreted as taking inner products of the mode-3 vectors of A and B and storing the 
results in C. 

The 3 different reshapings A, B and C of the tensor V contain exactly the same entires 

as V but in different order. 

• A corresponds to the grouping {{1, 2}, {3, 4}} of the variables. The rows of A correspond to 
dimensions 1 and 2 of and its columns to dimensions 3 and 4. Suppose all observed variables 
take values from {1, . . . , n}, then entry of A at {xi + n{x2 — l))-th row and (2:3 + n(x4 — l))-th 
column is equal to X2, X3, X4); 

• B corresponds to the grouping {{1, 3}, {2, 4}}, and its entry at (xi + n{x^ — l))-th row and 
{x2 + n(x4 — l))-th column is equal to 7^(xi, X2, xa, X4); 

• C corresponds to the grouping {{1, 4}, {2, 3}}, and its entry at {xi +n{x/i — l))-th row and 
{x2 + n(x3 — l))-th column is equal to 'P(xi, X2, X3, X4). 

9 Matrix Representations A, B, C of V 



From V to A, B, C: 

Let X G Y G M*^''', Z G M"^', X = (xi, . . . 

we will use in our derivations is the following 



, Xk) and Z = {zi, . . . , zi). A useful property that 



XYZ^ = J2xiyijzJ. (20) 

We can derive the formula for A starting from the element-wise formula ([T|) 

V{xi,X2,X3,X4) = Y,P{xi\h)P{x2\h)P{h,g)P{x:i\g)P{x4\9) 

and placing all entries in the matrix A in the correct order. Note that given h and g we only need 
one column of each Pi\h, P2\h, -fslG and P41G, which we will denote by {Pi\H)h, {P2\H)h, (-f3|G)g 
and {P4\G)g- III order to obtain a matrix such that Xi and X2 are mapped to rows and X3 and 
X4 are mapped to columns, we need to map all possible products of single element of {Pi\H)h and 
single element of {P2\H)h to rows and and similarly, we need to map all possible products of single 
element of {P3\G)g and single element of {Pi\G)g to columns. This can be done using Khatri-Rao 
products in the following way 

^ = E iPl\H)h) {PHG)hg [{PA\G)g © {Pz\G)gY 

h,g 

^ {P2\hQPi\h) Phg {P4\gQP3\gV- 
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The matrix B is unfolding of V, such that the rows of B correspond to Xi and X3 and the 
columns of B correspond to X2 and X4. We have 



B 



block- 1201 

03 



Y.[(P^\G)9®(Pl\H)h) {PHG)hg [{P,\G)g Q {P2\H)h 
h,g 

E ((^3|g)<; ® {Pl\H)h) {PHG)hg {{P,\g)] ® (^^21//)^ 
h,g 

Y.^PHG)hg {{Pz\g)9{P^g)1) ® {{Pl\H)h{P2\H)l) 
h,g 

E {Y.(P^G)hgiP3\G)g{P4\GVg) ® )/^(^2|h)/[ 
h g 

{P3\G dmg{iPHG)h) PI\g) ^ {iPl\HUP2\H) 

h 

Yl (^3|G ® {Pl\H)h) diag{{PHG)h) [PJIG ^ {P2\H 
h 

{Pz\g(^Pi\h) diag(PHG(0) {pJ\g®P2\h) 
[P^G ®Pi\h) diag(P^/G(:)) {Pa\g ® P2\h) ■ 



The expression for C is derived in a similar way. 
Other representations of A, B, C: 

Using the properties in Section [8] and the formulas ([5|)-([7|) for the matrix unfoldings A, B and C, 
we can derive the following additional formulas, 



A 



{P2\hQPi\h) PhG (P4|G0^3|g) 
{lnP2\HQPl\HlH) PhG [in Pi\G & P3\G IgY 

{ln®Pl\H) {P2\hQIh) PhG (A|G0^g)^ ^3|g) 



/Pl\H 



\ 



Pi\hJ 



' ^2\H > 

(2,1) 
P2\H 



Phg 



'^4|G * 

(1.2) 
Pi\G 

(2,1) 
P^IG 



/^3|G 



\ 



(21) 



^'sig/ 
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B 



\P3\G ) 



Pz\g®Pi\h) diag(PHG(0) {Pa\g®P2\hV 

P3\G Ig ® In Pi\h) diag(PHG(0) {P^G ^G ® 4 ^2|h)^ 

(^Ig'^Pi\h) dmg{PHG{-)) {Ig®P2\h 

. \ (Pl\H \ (P2\H \ 



P-i\G ® In 



Pi\G ® In 



\ 



J 



\ 



diag(PjyG(0) 



Pi\hJ 



V 



P2\hJ 



\ 



J 

(22) 



where {p^'^'^^) is a diagonal block of size (n x n) with all diagonal elements equal to p^^'^\ 

The formula for C can be obtained from the ones for B by swapping the positions of P^g ^'^d 
Pa\g- 

Rank properties of A, B, C: 

In this section we prove the rank properties used in Section [3.21 of the paper. 

Lemma. If X e R™^", Y G R"^'', Z e R'^™, Y has full row rank, and Z has full column 
rank, then 

rank(Xy) = rank(X), 
rank(ZX) = rank(X). 



We assume that all CPTs have full column (or row) rank. Then the first two matrices in (I2ip 
also have full column rank. The last two matrices have full row rank. Prom the lemma, it follows 
that 

rank(^) = rank(PHG) = ^ (23) 

Analogously, the first two matrices in (j22p have full column rank. The last two matrices have 
full row rank. From the lemma, it follows that 



i.e., generically. 



rank(i?) = nnz(P//G'), 
rank(B) = k"^. 



(24) 



10 Algorithms 



Algorithm 3 Tnext = Quartet Tree(TL, T2, T3, X4) 
Require: Leaf(T): leaves of a tree T; 
1: for j = 1 to 3 do 

2: Xi ^ Randomly choose a variable from Leaf(7i) 
3: end for 

4: i* ^ Quartet (Xi, X2, X3, X4), Tnext ^ Ti* 
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Algorithm 4 T = Insert (T, T, Xi) 

Require: Left(7') and Right(7'): left and right child branch of the root respectively; T+T': return 
a new tree connecting the root of two trees by an edge and use the root of T as the new root 
1: if I Leaf (T) I = 1 then 

2; T Form a tree with root R connecting Leaf(T) and X^. 
3: else 

4: Tnext ^ Quartet Tree (Left (T), Right (T), T, Xi) 

5: if Tnext = Left(T) then 

6: r ^ Insert(7;ext, Right(r) + T, X,) 

7: else if Tnext = Right (T) then 

8: r ^ Insert(7;ext, Left(r) + f, Xi) 

9: end if 
10: end if 
11: T^T+T 



Algorithm 5 T = BuildTree({Xi, . . . , Xd}) 
1: Randomly choose Xi, X2, X3 and X4 
2; i* ^ Quartet(Xi, X2, X3, X4) 

3: T ■'r- Form a tree with two connecting hidden variables H and G, where H joins Xi* and X4, 

while G joins variables in {Xi,X2, X3} \ {Xi*} 
4: for i = 5 to d do 

5: Pick a root R from T which split it to three branches of equal sizes, and Text ^ 

QuartetTree(Left(r), Right(r), Middle(r), Xi) 
6: if Text = Left(T) then 

7: r ^ Insert (7;ext, Right (T) + Middle(r), Xi) 

8: else if Text = Right (T) then 

9: T ^ Insert (T^ext, Left(r) + Middle(r), Xi) 
10: else if Tnext = Middle(T) then 
11: T ^ InseitiText, Right(r) + Left(r), Xi) 
12: end if 
13: end for 



11 Recovery Conditions for Quartet 

Latent variables H and G are independent. In this case, lank^PHc) = 1) since P{h,g) = 
P{h)P{g). Applying the relation in Equation [8l we have that rank{A) = 1 ^ rank(B). Further- 
more, since A has only one nonzero singular value, we have ||^||* = ||^||f = ||-S||_p < since 
II-^IIf < 11-^11* for III this case, we know for sure that the nuclear norm quartet test will 

return the correct topology. 

Latent variables H and G are not independent. We analyze this case by treating it as 
perturbation A away from the Phg in the independent case. We want to characterize how large A 
can be while still allowing the nuclear norm quartet test to find the correct latent relation. Suppose 
A± and B± are the unfolding matrices in the case where H and G are independent. Suppose we add 
perturbation A to Phg, then A± = [P2\h Q Pi\h) Phg {P4\g & P3\g)~^ ^^id its perturbed version is 
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A = [P2\H Pi\h) {Phg + A) (-P41G P^\g) ■ We want to bound the difference | \\A^ 
We have 

\\\A^\U-\\A\U\=\£MAl.)-Y.<y^{A) 

i 

< y^.\a,{A^)-cJi{A)\ 
^ — 

113 

<\\{P2\hQPi\h) a {P^GQPz\Gf\l 
<||i^2|//0^'l|//||^ I|A||f ||^'4|G0^3|gL 

<fc||A||^, 

II 1 1 2 

since P2\h Pi\H and ^41^ P-i\G are CPTs with k cohimns each, and thus ||-P2|H H^? ^ ^ 
and ||-P4|G0-f3|G|||' < ^• 

Analogously, = (-P3IG Pi\h) diag(PffG(0) (-^416' P2\h) and its perturbed version 
\s B = (P31G Pi\h) diag(PffG(:) + A(:)) (^41,3 P2\hY ■ We want to bound the difference 
|||^±||* - 11^11*1- We have 

\\\Bi_\U - \\B\U\ = \Y,MBi-) - 

i 

< y^.\(JiiB^)-aiiB)\ 
^ — 

US 

< \\B±-B\l 

<\\{P3\g(^Pi\h) diag(A(:)) (P4IG A|//)^|L 
<\\P3\G^Pi\h\\p ||diag(A(:))||^ ||^'4|G ^2|/f ||^ 
<A;2||diag(A(:))||^ 
= ^'||A||^, 

since P^^q Pi^jj and P41G P2\h are CPTs with k^ columns, and thus \\P3\G ^ and 

\\Pa\g^P2\h\\1 < 

Therefore, we get the following upper and lower bound: 



If we require that 



then we will have 



* < \\A±\U + k \\A\\p, 
\\B\U > \\B^\U-k^\\A\\p. 

P^ll, + A; ||A||^< 115^11, -fc^llMI^, 
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We can derive similar condition for the relationship \\A\\^: o ||C||=k. Let 

6 := min{||i?_L||,, - ||C_l||* - ||yl_L||^}. 

We thus obtain an upper bound on the allowed perturbation: 

d 

12 Recovery Conditions for Latent Tree 

When latent variables H and G are independent, we have that Phg = PrPc ■ ^^i^ case 



(25) 



> 



(^3|G ® Pl|//)(diag(PG) «> dis.g{PH)){PA\G ® P2\H 
(P3IG diag(PG)^'4TG) ® (Pm dmg{PH)Pj\H) 



-P34 
-P34 



Pi 



121 



Pi 



12|li? 



(26) 



and 



1^1 



\\{P2\hQPi\h) PhP^ (P4|g0P3|g) 

||Pi2(:)P34(:)^|L 

||Pi2(:)P34(:)^||^ 

||P34 8'Pl' 



(27) 



12 Hi? 



and thus 



PhP(\ + A. The 



\\A±\l < \\B^\l . 

Suppose now that H and G are not independent and thus we have Phg 
goal is to characterize all As, such that ||^||^ < ||P||,, still holds for any quartet. Prom the above 
formulas it follows that the upper bound on A depends only on pairwise marginal distributions. 

Since the perturbed version of PrPq remains a joint probability table, all entries of the per- 
turbation matrix A have to sum to 0, i.e., l'^A(:) = 0. We further assume that each column sum 
and each row sum of A is also equal to 0, i.e., l^A = and A 1 = 0. In this case, 1^A(:) = is 
satisfied automatically. 

The recovery conditions for latent trees can be derived in two steps. The first step is to 
provide recovery conditions for those quartet relations corresponding to a single edge H — G m. 
the tree (Figure [71 left). In the second step we study quartet relations corresponding to paths 
H — Ml — M2 — ■ ■ ■ — Ml — G in the tree (Figure [TJ right). We provide a condition under which 
the recovery condition of such quartets is reduced to the recovery condition on quartets from step 
1. That is, we provide a condition under which the perturbation on the path is guaranteed to be 
smaller than the maximum allowed perturbation on an edge. 

Let 

6 := max HAi/GH^ . 

H—G an edge 



Our goal is to obtain conditions on 6, under which recovery of any quartet relation is guaranteed. 
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Figure 7: Topologies of quartets corresponding to a single edge H — G and to a path H — Mi 
M2 Mi-G. 



12.1 Quartets Corresponding to a Single Edge 

The first step is readily obtained from ^TTjif we assume that all CPTs (including Px^^\Hj Px^^IH^ Pxi^\G^ Pxijc) 



have full rank. Let ^min = minquarter g^g- From (1251) . we have 



6 < min 



— ||^±||* 



k^ + k' 



(28) 



12.2 Quartets Corresponding to a Path 



Path of independent latent variables. For the second step, we start again from the fully 
factorized case (independent case). The joint probability table Phg of the two end points in a path 



H -Ml- M2 



Ml -G is 



Phg — Ph\MiPmi\M2 " " " Pmi\gPg 



'^PMiG 



PhMi diag(PMi ) ^Phhhh diag(PM2 ) ^ • • • diag(PA/, j - i»i 
PhPJ,^ diag{PM,r^PM,PM, diag(PM2)"' • • • diagiPM^'PAuPc 
Ph{pJi, Amg{PM,r^)PMAPlh diag(PAfJ^') • • • diag(PAf, 



'PAhP^ 



PhI^Pm,!^ ■■■I^PivuPg 
PhP^, 



where we have used Pjj_ diag(PAfi(0) ^ = l"*"- 

Path of dependent latent variables. Next, we add perturbation matrices to the joint 



probability tables associated with each edge Mj — Mj in the tree and assume that the resulting 
joint probability table PuiMj = PkuPm. + ^ij has full rank. Furthermore, we assume that the 
resulting joint probability table Phg of the two end points in a path H — Mi — M2 ■ ■ ■ Mi — G also 
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has full rank. We have 

Phg = Ph\MiPmi\M2 ■ ■ ■ Pmi\gPg 

= PhMi diag{PM^y^PMiM2 diag{PM2y^ ■ ■ ■ diag{PMiy^ PmiG 

= {PhPm, + Ai) diagiPM.rHPAhPM, + A2) diagiPM^r' ■ ■ ■ dmg{PM,rHPKhPG + ^i) 
= PhPJi, diag{PM,)~'PM,Pli, diag(PA/J-' • • • diagiPM,)-' Pm.P^ 
+ (terms not involving all the As will all be zero) 

+ Ai diag(PMi)~^A2 diag(PM2)~^ • • • diag(PM,)"^A/ 
= PhP^ + Ai diag(PA/i)"'A2 diag(PM2)'' • • • diag(PM,)"'A/ . (29) 

The reason why we do not need to perturb the term diag(PAf. )^^ is that if P/\f. is the perturbed 

Pm, = Pm^m, 1 = {Pm.pIi^ + A,,)l = Pm.P^I + = Pm„ 

since Ajj 1 = 0. And the reason why terms not involving all the As will all be zero is that such 
terms contain either l'''A = O"'" or A 1 = 0. 

Now, from (I29p it follows that the perturbation corresponding to the path H — Mi — M2 — ■ ■ ■ — 
Ml -G is 

A := Aidiag(PMi)-'A2diag(PM2)~' • • • diag(PMj^'A/. (30) 

Bounding the perturbation on the path. We still need to show under which condition A 
from ()30p will satisfy ||A||^ < 6. Assume that the smallest entry in a marginal distribution of an 
internal node is bounded from below by 7min) ^-e., 

7min := min minPff(i). 

hidden node H i 

Then we have 

||A||j^ = II Ai diag(PAfJ"^A2 diag(PAf2)-^ " " " ^i\\f 

< II Ai diag(PMi)"i^ IIA2 diag(PAf2)~^ 11^ • • • ||Az||^ 

'min 

The perturbation A on the path H — Mi — M2 • • • M; — G is bounded by 6 if < 5, i.e., if 

S < 7min- (31) 

From ()28|) and (I3ip we arrive at the condition for successful quartet test for all quartets 

Intuitively, it means that the size of the perturbation 6 away from independence can not be too 
large. In particular, it has to be small compared to the smallest marginal probability 7min of a 
hidden state; it also has to be small compared to the smallest excessive dependence 6'min- 
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13 Statistical Guarantee for the Quartet Test 

Based on the concentration result for nuclear norm in (llip . we have that, given m samples, the 
probability that the finite sample nuclear norm deviates from its true quantity by e := ^^=- is 
bounded 

FjPII* > Mil* +e| < 2e ^ and P 1 < - e| < 2e — ^, (32) 

2 

where we have used r = . Now we can derive the probability of making an error for individual 
quartet test. First, let q = {{ii, ^2}, {^3, ^4}} and 

a = mm{\\B{q)\U - \\A{q)\U, \\C{q)\U - \\A{q)\U} . 

Then, for sufficiently large m, we can bound the error probability by 

P {Quartet test returns incorrect result} 
= P|P||, > \\B\\^ or > ||C||,} 

< P|||l||, > +P|||1||* > (union bound) 
= P|P||, - pll, + \\B\\^ - \\B\U > \\B\\^ - pll,} 

+ P{||I||, - Pll* + ||C7||, - > - pll,} 

< p/llllL - lUIL > Mk^Bkl +p/||B|L - II^IL > i^"* " 



+ f{||% - \\A\U > Hk_Bk} - iicii. > M 

<P{||I||, - \\A\U > |} +P{||5||* - \\B\U > |} 
+ P{||1||, - \\A\U > |} +P{||C||, - > |} 

14 Statistical Guarantee for the Tree Building Algorithm 

Let ag = mm{\\Biq)\U - P(g)||*, \\C{q)\U - \\A{q)\U}. We define 

amin = min a,. 

quartet q 

For a latent tree with d observed variables, the tree building algorithm described in the paper 
requires 0{dlogd) calls to the quartet test procedure. The probability that the tree is constructed 
incorrectly is bounded by the probability that either one of these quartet tests returns incorrect 
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result. That is 



P {The latent tree is constructed incorrectly} 

< P {Either one of the 0{dlogd) quartet tests returns incorrect result} 

< c - d log d ■ P {quartet test returns incorrect result} (union bound) 

< 8c • d log a • e 32 ^ 

which implies that the probability of constructing the tree incorrectly decreases exponentially fast 
as we increase the number of samples m. 
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